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Abstract — The fundamental problem of multiple secondary 
users contending for opportunistic spectrum access over multiple 
channels in cognitive radio networks has been formulated re- 
cently as a decentralized multi-armed bandit (D-MAB) problem. 
In a D-MAB problem there are M users and N arms (channels) 
that each offer i.i.d. stochastic rewards with unknown means so 
long as they are accessed without collision. The goal is to design 
a decentralized online learning policy that incurs minimal regret, 
defined as the difference between the total expected rewards 
accumulated by a model-aware genie, and that obtained by all 
users applying the policy. We make two contributions in this 
paper. First, we consider the setting where the users have a 
prioritized ranking, such that it is desired for the A'-th-ranked 
user to learn to access the arm offering the A-th highest mean 
reward. For this problem, we present the first distributed policy 
that yields regret that is uniformly logarithmic over time without 
requiring any prior assumption about the mean rewards. Second, 
we consider the case when a fair access policy is required, i.e., it is 
desired for all users to experience the same mean reward. For this 
problem, we present a distributed policy that yields order-optimal 
regret scaling with respect to the number of users and arms, 
better than previously proposed policies in the literature. Both 
of our distributed policies make use of an innovative modification 
of the well known UCB1 policy for the classic multi-armed bandit 
problem that allows a single user to learn how to play the arm 
that yields the A'-th largest mean reward. 

I. Introduction 

Developing dynamic spectrum access mechanisms to enable 
more efficient spectrum utilization is one of the most chal- 
lenging issues in cognitive radio systems (TJ. In this paper, 
we focus on a problem of opportunistic spectrum access in 
cognitive radio networks, where at every time slot, each of 
the M decentralized secondary users searches for idle channels 
which are not occupied by primary users temporarily among 
N > M channels. We assume that the throughput of these TV 
channels evolves i.i.d. over time with any arbitrary, bounded- 
support distribution, which is unknown to the users. These 
distributed players can only leam from their local observations 
and collide (with reward penalty) when choosing the same 
arm. The desired objective is to develop a sequential policy 
running at each user to make a selection among multiple 
choices, where there is no information exchange, such that 
the sum-throughput of all distributed users is maximized, as- 
suming an interference model whereby at most one secondary 
user can derive benefit from any channel. 



Multi-Armed Bandit problem (MAB, see J2-0) is a fun- 
damental mathematical framework for learning the unknown 
variables. In its simplest form of classic non-Bayesian version 
studied by Lai and Robbins |2|, there are N arms, each pro- 
viding stochastic rewards that are independent and identically 
distributed over time, with unknown means. A policy is desired 
to pick one arm at each time sequentially, to maximize the 
reward. Anantharam et al. [3| extend this work to the case 
when M simultaneous plays are allowed, with centralized 
scheduling of the players. 

A fundamental tradeoff between exploration and exploita- 
tion is captured by MAB problems: on the one hand, various 
arms should be explored often enough in order to leam their 
parameters, and on the other hand, the prior observations 
should be exploited to gain the best possible immediate 
rewards. A key metric in evaluating a given policy for this 
problem is regret, which is defined as the difference between 
the expected reward gained by a prior that always makes the 
optimal choice and that obtain by the given policy. The regret 
achieved by a policy can be evaluated in terms of its growth 
over time. Many of the prior works on multi-armed bandits 
show logarithmic scaling of the regret over time. 

While most of the prior work on MAB focused on the 
centralized policies, motivated by the problem of opportunistic 
access in cognitive radio networks, Liu and Zhao [7|, [8|, and 
Anandkumar et al. J9), [1 1 1 have both developed policies for 
the problem of M distributed players operating N independent 
arms. There are two problem formulations of interest when 
considering distributed MAB: a) the prioritized access prob- 
lem, where it is desired to prioritize a ranked set of users so 
that the AT-th ranked user learns to access the arm with the 
A'-th highest reward, and b) the fair access problem, where 
the goal is to ensure that each user receives the same reward in 
expectation. For the prioritized access problem, Anandkumar 
et al. present a distributed policy that yields regret that 
is logarithmic in time, but requires prior knowledge of the 
arm reward means. For the fair access problem, they propose 
in (9|, iflOl a randomized distributed policy that is logarithmic 
with respect to time and scales as 0(M 2 N) with respect to 
the number of arms and users. Liu and Zhao Q, also treat 
the fair access problem and present the TDFS policy which 
yields asymptotically logarithmic regret with respect to time 
and scales as 0(M(max{Af 2 , (N - M)M})) with respect to 
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the number of arms and users. 

In this paper we make significant new contributions to both 
problem formulations. For the prioritized access problem, we 
present a distributed learning policy DLP that results in a regret 
that is uniformly logarithmic in time and, unlike the prior work 
in (9), does not require any prior knowledge about the arm 
reward means. For the fair access problem, we present another 
distributed learning policy DLF, which yields regret that is also 
uniformly logarithmic in time and that scales as 0(M(N — 
M)) with respect to the number of users M and the number 
of arms N. As it has been shown in f8] that the lower-bound 
of regret for distributed policies also scales as Cl(M(N—M)), 
this is not only a better scaling than the previous state of the 
art, it is, in fact, order-optimal. 

A key subroutine of both decentralized learning policies 
running at each user involves selecting an arm with the desired 
rank order of the mean reward. For this, we present a new 
policy that we refer to as SL(K), which is a non-trivial 
generalization of UCB1 in 0. SL(K) provides a general 
solution for selecting an arm with the K-th largest expected 
rewards for classic MAB problems with N arms. 

This paper is organized as follows. We present in section UT1 
the problem formulation. In section Hill we first present our 
SL(K) policy, which is a general policy to play an arm with 
K-th largest expected reward for classic multi-armed bandits, 
and then present our decentralized DLP policy in section 
HVl and DLF policy in section [V] based on SL(K) policy. 
Both policies are polynomial-storage polynomial-time-per-step 
learning policies. We show that the regrets of all policies we 
proposed are logarithmic in time and polynomial in the number 
of users and channels, and we compare the upper bound of 
the regrets of different policies. In section [VI] we compare the 
decentralized learning policies with simulation results. Finally, 
section NTH concludes the paper. 

II. Problem Formulation 

We consider a cognitive system with N channels (arms) and 
M decentralized secondary users (players). The throughput of 
N channels are defined by random processes Xi(n), 1 < i < 
N. Time is slotted and denoted by the index n. We assume 
that Xi(n) evolves as an i.i.d. random process over time, with 
the only restriction that its distribution have a finite support. 
Without loss of generality, we normalize Xi(n) £ [0, 1]. We do 
not require that Xj(n) be independent across i. This random 
process is assumed to have a mean 8i = E\Xj\, that is 
unknown to the users and distinct from each other. We denote 
the set of all these means as = 1 < i < N}. 

At each decision period n (also referred to interchangeably 
as time slot), each of the M decentralized users selects an 
arm only based on its own observation histories under a 
decentralized policy. When a particular arm i is selected by 
user j, the value of Xi(n) is only observed by user j, and if 
there is no other user playing the same arm, a reward of Xi (n) 
is obtained. Else, if there are multiple users playing the same 
arm, then we assume that, due to collision, at most one of 
the conflicting users j' gets reward Xi(n), while the other 



users get zero reward. This interference assumption covers 
practical models in networking research, such as the perfect 
collision model (in which none of the conflicting users derive 
any benefit) and CSMA with perfect sensing (in which exactly 
one of the conflicting user derives benefit from the channel). 
We denote the first model as Mi and the second model as 
M 2 . 

We denote the decentralized policy for user j at time n as 
7Tj(ri), and the set of policies for all users as tt — {iTj(n), 1 < 
j < M}. We are interested in designing decentralized policies, 
under which there is no information exchange among users, 
and analyze them with respect to regret, which is defined as 
the gap between the expected reward that could be obtained 
by a genie-aided perfect selection and that obtained by the 
policy. We denote 0* M as a set of M arms with M largest 
expected rewards. The regret can be expressed as: 

n 

iEO* M t=l 

where S„(t)(t) is the sum of the actual reward obtained by all 
users at time t under policy ir(t), which could be expressed 
as: 

N M 
i=l j=l 

where for Mi, Ijj(t) is defined to be 1 if user j is the only 
user to play arm i, and otherwise; for M 2 , L.j(i) is defined 
to be 1 if user j is the one with the smallest index among all 
users playing arm i at time t, and otherwise. Then, if we 
denote V^(n) = £K" =1 Ii,j(i)]> we have: 

n N M 

E^s^t)] = EE^[^wi ^ 
*=i i=i j=i 

Besides getting low total regret, there could be other system 
objectives for a given D-MAB. We consider two in this paper. 
In the prioritized access problem, we assume that each user 
has information of a distinct allocation order. Without loss 
of generality, we assume that the users are ranked in such a 
way that the m-th user seeks to access the arm with the m- 
th highest mean reward. In the fair access problem, users are 
treated equally to receive the same expected reward. 

III. Selective Learning of the K-th Largest 
Expected Reward 

We first propose a general policy to play an arm with the 
K-th largest expected reward (1 < K < N) for classic multi- 
armed bandit problem with N arms and one user, since the 
key idea of our proposed decentralized policies running at each 
user in section [IV] and [V] is that user m will run a learning 
policy targeting an arm with m-th largest expected reward. 

Our proposed policy of learning an arm with K-th largest 
expected reward is shown in Algorithm Q] 

We use two 1 by A vectors to store the information after 
we play an arm at each time slot. One is (0i)i X N in which 
9i is the average (sample mean) of all the observed values of 
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Algorithm 1 Selective learning of the A-th largest expected 
rewards (SL(A)) 

v. 1 1 Initialization 

2: for t = 1 to N do 

3: Let i = t and play arm i; 

4: § i (t)=X i (t)\ 

5: m(t) = 1; 

6: end for 

i: 1 1 Main loop 

8: while 1 do 

9: t = t + l\ 

10: Let the set Ok contains the A arms with the A largest 
values in (0]l 



i(t-l) + 



21nt 



11: 



i niit-iy 

Play arm k in 0if such that 



k = are: min — 1) 

■6Ck 



a />\ _ e fc (t-i)n fc (t-i)+x fc (t) . 

° k(T > - n k (t-X)+l 

n k {t) = n k (t - 1) + 1; 
end while 



' 21nt 



(4) 



(5) 



Xi up to the current time slot (obtained through potentially 
different sets of arms over time). The other one is (rij)ixiv in 
which Hi is the number of times that Xi has been observed 
up to the current time slot. 

Note that while we indicate the time index in Algorithm Q] 
for notational clarity, it is not necessary to store the matrices 
from previous time steps while running the algorithm. So 
SL(A") policy requires storage linear in N. 

Remark: SL(A) policy generalizes UCB1 in and 
presents a general way to pick an arm with the A-th largest 
expected rewards for a classic multi-armed bandit problem 
with N arms (without the requirement of distinct expected 
rewards for different arms). 

Now we present the analysis of the upper bound of regret, 
and show that it is linear in N and logarithmic in time. We 
denote Ak as the set of arms with A-th largest expected 
reward. Note that Algorithm Q] is a general algorithm for 
picking an arm with the K-th largest expected reward for the 
classic multi-armed bandit problems, where we allow multiple 
arms with A-th largest expected reward, and all these arms 
retreated as optimal arms. The following theorem holds for 
Algorithm []] 

Theorem 1: Under the policy specified in Algorithm Q] the 
expected number of times that we pick any arm i £ Ak after 
n time slots is at most: 

8 Inn 2tt 2 

r - ' (6) 
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where 
reward. 



\®k ^ 9k is the A-th largest expected 



Proof: Denote Ti (n) as the number of times that we pick 

arm i £ Ak at time n. Denote C t , ni as *J ^ L+ n- • Denote 

9i tUi as the average (sample mean) of all the observed values 
of Xi when it is observed n.j time. 0* K is denoted as the set 
of K arms with A largest expected rewards. 

Denote by A(n) the indicator function which is equal to 
1 if Ti(n) is added by one at time n. Let / be an arbitrary 
positive integer. Then, for any arm i which is not a desired 
arm, i.e., i 4 Ak' 



Ti(n) = l+ Hli(t)} 

t=iV+l 
n 

<l+ HIi(t),Ti(t-l)>l} 



t=N+l 



(7) 



<i+ J2 Wi(t)A<o K) T i (t-i)>i} 

t=N+l 

+ i{ii(t),9i>e K ,Ti(t-i)>i}) 

where 1 (x) is the indicator function defined to be 1 when the 
predicate x is true, and when it is false. 

Note that for the case 9i < 9k, arm i is picked at time t 
means that there exists an arm j(t) £ (D* K , such that j(t) £ 
Ok- This means the following inequality holds: 

8j(t),T m (t-l)+Ct-l,T m {t-l) < 9i,Ti(t-l)+C t - ltl . i(t - iy (8) 

Then, we have 

n 

t{I l (t),9 l <9 K ,T t (t-l)>l} 

t-N+l 
n 

^ ts J)m,T m (t-i) + c t _i, T . (t)(t _i) 

t=N+l 

< hm-i) + C t - XtTi{t - X) ,Ti(t -i)>/} 



< 



t=N+l 



< max i rH + C t _i n J 

l<rii<t 



oo t-1 t-1 



- ^m.nm + Ct ^m - di > n < + Ct ™} 

t—1 rij^—1 m—l 

(9) 

Qj(t),n m + c t,n jW < 9i, ni +C t , ni implies that at least one 
of the following must be true: 



(10) 

Oi,m > 6% + Ct,m, (11) 

e m <6i + 2C t , ni . (12) 

Applying the Chemoff-Hoeffding bound ifTTl . we could find 
the upper bound of < TT~0b and (fTTT i as, 

< 



-41ni _ ^-4 



(13) 
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Pr{6 i>ni >ei + C t<ni }<e 



-4 In t 



(14) 



For I > 



8 In n 



6i - 2C, 



> 



> 



>K 



-Bi-2 



l2A* hit 



>K 



8 Inn 
A K , = 0, 



(15) 
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Al- 



SO ( fT2] i is false when I > 

Note that for the case 9{ > 9k, when arm i is picked at time 
i, there are two possibilities: either Ok = 0* K , or Ok ^ 0* K . 
If Ok — 0* K , the following inequality holds: 



7i,Ti(t-X) 



Ti(t-l) < 8K,T K (t-l) 



Ct-i 



T K (t-l)- 



If 0k ^ 0^, Ok has at least one arm h(t) £ G* K . Then, we 
have: 

0i,Ti(t-l) ~ C t -i, Ti (t-l) <0h(t),T h(t) (t-l) ~ C t _i, Th(t) (t-l)- 

So to conclude both possibilities for the case 9{ > 9k, if we 
denote 0* K _ l — 0* K — Ak, at each time t when arm j is 
picked, these exists an arm hit) ^ 0* K _ X , such that 

9i,Ti(t-l) ~ Ct-l,Ti(t-l) < Qh(t),T Ht) {t-X) - C*-l,T Mt) (t-l)' 

(16) 

Then similarly, we can have: 

^ i{ii(t),6i >e K ,Ti(t-i) >i} 

t=N+l 

oo t-1 t-1 

^ E E E Hhm ~ C t,m - 6h{t),n Kt) - O t ,n hW ) 
t=l m=l n h{t) =l 

(17) 



Hence, we have 
8 Inn 



M[Ti(n)] < 



A 2 



oo t-1 t-1 

EE E 

4=1 n m =l „ I = |~(81nn)/A^ 4 ] 



(Pr{0 i(t) ,„. (t) < i(t) - C t ,„. (t) } + Pr{fl iin( > fli + C t ,„J) 
t-1 



+E E E 

t=l n*=[(81nn)/A=. { ] 

(Pr{^, ni < 0i - C t ,„J + Pr{t Ht)>nhW > 9 m + C t , nhlt) }) 

r. i OO t — 1 t — 1 

> In n 



^ a^- + 1 + 2 E E E 2r 



A',/ 



i— 1 nj(+)=\ 7\>i=\ 
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(22) 



The definition of regret for the above problem is different 
from the traditional multi-armed bandit problem with the goal 
of maximization or minimization, since our goal now is to 
pick the arm with the if-th largest expected reward and we 
wish we could minimize the number of times that we pick 
the wrong arm. Here we give two definitions of the regret to 
evaluate the SL(A") policy. 

Definition 1: We define the regret of type 1 at each time 
slot as the absolute difference between the expected reward 
that could be obtained by a genie that can pick an arm with 
K-th largest expected reward, and that obtained by the given 
policy at each time slot. Then the total regret of type 1 by time 
n is defined as sum of the regret at each time slot, which is: 



mu®;n) = J2\9 K - E«[S T{t) (t)} 



(23) 



Note that 9i <ni — C t . ni < 6h(t),n h m 

- C t .n h{t) implies one 

of the following must be true: 



< 



< 9h(t) + 2Ct,rii- 



(18) 

(19) 
(20) 



We again apply the Chernoff-Hoeffding bound and get 



Pr{9^ < 9i-C t>ni } < t-\ Pr{9 hW , nhW > v h {t) 

Ct,n h{t) ] < £~ 4 - 

Also note that for I > 
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> 



) — 2Ct. ni 

?jc - &ka > 0, 



(21) 



so d20l i is false. 



Definition 2: We define the total regret of type 2 by time 
n as the absolute difference between the expected reward that 
could be obtained by a genie that can pick an arm with K-th 
largest expected reward, and that obtained by the given policy 
after n plays, which is: 

n 

9^(9; n) - \n9 K - I (24) 



Here we note that Vn, fR^i®: 71 ) < 9^(0; n) because 
\n9 K - ^ELi $r (t) (i)]| = \n»K - Eti E*[S At) (t)}\ < 
E?=i\0k - E«[S At) (t)]\. 

Corollary 1: The expected regret under both definitions is 
at most 



E,81nn, 27T 2 , ■r-^ 

(^-) + (! + —) E 

i-.if.Ak i'-i£A k 



(25) 
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Proof: Under the SL(K) policy, we have: 

9^(0;n)<$HI(G;ri) 

n 

= J2\0K-E*[S n(t) (t)]\ 
t=i 

= ^ Ajc,<E[Ti(n)] 



(26) 



< 



81nn x 
A 



A'., 



(1 



2ir 2 



E a*. 



Corollary Q] shows the upper bound of the regret of SL(1T) 
policy. It grows logarithmical in time and linearly in the 
number of arms. 

IV. Distributed Learning with Prioritization 

We now consider the distributed multi-armed bandit prob- 
lem with prioritized access. Our proposed decentralized policy 
for N arms with M users is shown in Algorithm [2] 

Algorithm 2 Distributed Learning Algorithm with Prioritiza- 
tion for N Arms with M Users Running at User m (DLP) 
// Initialization 
for t = 1 to N do 

Play arm k such that k = ((m + t) mod N) + 1; 

nf(t) = 1; 
end for 

// Main loop 
while 1 do 

t = t + l; 

Play an arm k according to policy SL(77i) specified in 
Algorithm [Q 

am( + \ _ ^(t-l)n^(t-l)+X k (t) . 

k W ~ „m (t _ 1) + 1 



<{t) = n%(t 
end while 



1; 



In the above algorithm, line [2] to [6] is the initialization part, 
for which user m will play each arm once to have the initial 
value in (6*- n )i x w and (n- n )i X Ar. Line [3] ensures that there 
will be no collisions among users. Similar as in Algorithm Q] 
we indicate the time index for notational clarity. Only two 1 
by N vectors, (^f')ixJV an d (^j")ixiV> are used by user m to 
store the information after we play an arm at each time slot. 

We denote o* m as the index of arm with the m-th largest 
expected reward. Note that {o* n }i< m <M = Denote 
Ajj — \0i — 6j\ f° r arm U j- We now state the main theorem 
of this section. 

Theorem 2: The expected regret under the DLP policy 
specified in Algorithm [2] is at most 



T2— + 1 



2. ^ 

m—1 i^o* 



3 ' 



M 

E E^a 2 " 

m=l h^m ° 



8 In ? 



27^, 
3 ' 



(27) 



Proof: Denote Tj im (n) the number of times that user m 
pick arm i at time n. 

For each user m, the regret under DLP policy can arise 
due to two possibilities: (1) user m plays an arm i ^ o* m \ (2) 
other user h ^ m plays arm o* n . In both cases, collisions may 
happen, resulting a loss which is at most o * , Considering 
these two possibilities, the regret of user m is upper bounded 
by: 

9^(0, m;n) < £ E[T i)TO (n)]^ + £ E[T . m , h (n)}6 . m 

(28) 

From TheoremQ] T^ m (n) and T * ,h(n) are bounded by 

/ i -i 81nn 2n 2 
E[T,, ro (n)] < 



E[T . , h (n)} < 



A 2 , ■ 
8 Inn 

ol ,0* 



1 



3 ' 

2tt2 
3 ' 



(29) 
(30) 



So, 



v-^ . 8 In n 27r 2 , 

5^(0, m;») < 51(^^ + 1 + —^ 



' A 2 „ - 

E, 8 Inn „ 27r 2 , 
( aVT + 1 + — 

The upper bound for regret is: 

M 

9T(e;n) = ^ 9T(0,m;ra) 



(31) 



A/ 

^EE( 



m—1 

In n 



A 2 , ■ 

m—1 i^o* n 

Ev^ . 8 Inn 
E(a? — 



(32) 



If we define A r 



Ajj, and 



= mm 

l<i<AT,l<j<M 

max we could get a more concise (but looser) upper 

l<i<N 

bound as: 



,81n 



n 



9^(9; n) < M(7V + M - 2)(^— + 1 + — )0 max . (33) 



2 z! 

T 



Theorem [2] shows that the regret of our DLP algorithm is 
uniformly upper-bounded for all time n by a function that 
grows as 0(M (N + M) In n). 

V. Distributed Learning with Fairness 

For the purpose of fairness consideration, secondary users 
should be treated equally, and there should be no prioritization 
for the users. In this scenario, a naive algorithm is to apply 
Algorithm [2] directly by rotating the prioritization as shown in 
Figure Q] Each user maintains two 1 by N vectors (O^mxn 
and (i!™)mxjVi where the j-th row stores only the observation 
values for the j-th prioritization vectors. This naive algorithm 
is shown in Algorithm [3] 
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Fig. 1 . Illustration of rotating the prioritization vector. 



Algorithm 3 A Naive Algorithm for Distributed Learning 
Algorithm with Fairness (DLF-Naive) Running at User m 
1: At time t, run Algorithm [2] with prioritization K = ((m + 
t) mod M) + 1, then update the AT-th row of (O^uxN 
and (u^mxn accordingly. 



We can see that the storage of Algorithm [3] grows linear in 
MN, instead of N. And it does not utilize the observations 
under different allocation order, which will result a worse 
regret as shown in the analysis of this section. To utilize all the 
observations, we propose our distributed learning algorithm 
with fairness (DLF) in Algorithm |4] 



Algorithm 4 Distributed Learning Algorithm with Fairness 
for N Arms with M Users Running at User m (DLF) 
// Initialization 
for t = 1 to N do 

Play arm k such that k = ((m + t) mod N) + 1; 

»!?(*) = 1; 
end for 

// Main loop 
while 1 do 

t = t + l; 

K = ((m + t) mod M ) + 1; 

Play an arm k according to policy SL(K ) specified in 
Algorithm Q] 

er(*-i)"L"(*-i)+^(i) . 

n£*(t-l) + l 



TO = 

end while 



l (t- 1) + 1; 



Same as in Algorithmic only two 1 by A vectors, (#™ 1 )i x jv 
and (ri™)ixAf, are used by user m to store the information 
after we play an arm at each time slot. 

Line QT| in Algorithm |4] means user m play the arm with 
the A-th largest expected reward with Algorithm Q] where the 
value of K is calculated in line [10] to ensure the desired arm 
to pick for each user is different, and the users play arms 
from the estimated largest to the estimated smallest in turns 
to ensure the fairness. 

Theorem 3: The expected regret under the DLF-Naive pol- 



icy specified in Algorithm [3] is at most 



o^GO^ m— 1 i^o^ 
M 



1 



2tH 



£ ££( 



8 In \n/M] 



3 ' 



(34) 



A 2 , . 

o;£0;m=U/m °h'°m 

Proof: Theorem [3] is a direct conclusion from Theorem 
[2] by replacing n with [Vi/Af], and then take the sum over all 
M best arms which are played in the algorithm. ■ 

The above theorem shows that the regret of the DLF-Naive 
policy grows as 0(M 2 (N + M) Inn). 

Theorem 4: The expected regret under the DLF policy 
specified in Algorithm H is at most 



N 



m£( 



8 In) 



A 2 ■ ■ 

i—1 mm, 2 



l 



v * , 8 In n 
M(M-l) Y, i^T- 

ieo* min >* 



2tT*. 

3 



(35) 



where A min! i = min A G » j. 

l<m<M m 

Proof: 

Denote K^if) as the index of the arm with the A-th (got 
by line [10] at time t in Algorithm!?] running at user m ) largest 
expected reward. Denote Q" l (n,) as the number of times that 
user m pick arm i ^ K^t) for 1 < t < n. 

We notice that for any arbitrary positive integer I and any 
time t, Qf{t) > I implies rii(t) > I. So © to (HTJ in the 
proof of Theorem [TJ still hold by replacing Ti(n) with Q™(n) 
and replacing A with K* n (t). Note that since all the channels 
are with different rewards, there is only one element in the set 

To find the upper bound of E[Q" l (n)], we should let / to 
be I > such that © and (EQj are false for all t. So 

we have, 
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Hence for user m, we could calculate the upper bound 
of regret considering the two possibilities as in the proof of 
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Theorem E] as: When (03]> holds, both (O and $M are false. Then Vi 6 

w 0*^, when n is large enough to satisfy (|43T >. 

^(e,m;n) <^Q™(n)^ max + ^ E Q£»^ (37) 

i=l fi/migO* 



So the upper bound for regret for m users is: 
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To be more concise, we could also write the above upper _ _ 

bound as: ' ' (Pr{0i, ni < h ~ C t , nt } + Pr{6 h{t):nh{t) > 9 h{t) + C t ,„ h(t) }) 
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Theorem 5: When time n is large enough such that 
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So when d43l is satisfied, a tighter bound for the regret in 
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the expected regret under the DLF policy specified in Algo- 27r 2 2tt 2 x-^ 

rithmHisatmost " ^ + M 2 (l + — )6> max + M(M - 1)(1 + — ) E 
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i£0* M mm ' 4 We could also write a concise (but looser) upper bound as: 
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Proof: The inequality d36b implies that the total number 

of times that the desired arms are picked by user m at time ?i _ _ m • , m , r™ n=n 

n Comparing Theorem |3J with Theorem |4] and Theorem |5J 

is lower bounded by n - £ + 1 + Since all the if we define C = 8( ^+ M) + (1 + ^£)7V + M, we can 

arms with M largest expected rewards are picked in turn by see that the re g ret of the' naive policy DLF-Naive grows as 

the algorithm, Vi e 0* M , we have °( m2 (N+M) In n), while the regret of the DLF policy grows 

as 0(M(N + M 2 ) In n) when < C, 0(M(N - M) Inn) 
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VI. Numerical Results 



where m(n) refers to the number of times that arm i has We P resent simulation results for the algorithms developed 

been observed up to time n at user m. (For the purpose of in this work ' va ^ in 8 the number of users and channels t0 

simplicity, we omit m in the notation of n t .) verif y the Performance of our proposed algorithms detailed 

»t . . 1 • , • , , . n ^ 8(n+m) , earlier. In the simulations, we assume channels are in either 
Note that when n is big enough such that -p^ > \^ — '- + 

2 " lnn mm idle state (with throughput 1) or busy state (with throughput 

I 1 + ~ir) N + we have ' 0). The state of each N channel evolves as an i.i.d. Bernoulli 

( N . process across time slots, with the parameter set Q unknown 

\ - 81nn 2tt 2 \ 81nn t o the M users. 

n — } (-7-5 h H — — ) > I -t-s — I • ,_. 

^min.i ^ / ^min Figure |2J shows the simulation results averaged over 50 

(43) runs using the three algorithms, DLP, DLF-Naive, and DLF, 
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- DLF 

- DLP 

■ DLF-Naive 




- DLP 

■ DLF-Naivi 



(a) N = 4 channels, M = 2 secondary users, (b) N = 5 channels, M = 3 secondary users, (c) A 7 " = 7 channels, M = 4 secondary users, 
6 = (0.9, 0.8, 0.7, 0.6). O = (0.9, 0.8, 0.7, 0.6, 0.5). = (0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3). 

Fig. 2. Normalized regret V s. n time slots. 
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(a) DLP policy. 



(b) DLF policy. 
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(c) DLF-Naive policy. 

Fig. 3. Number of times that channel i has been chosen by user m up to time n = 
(0.9,0.8,0.7,0.6,0.5). 



10 6 , with N = 5 channels, M = 3 secondary users and 



and the regrets are compared. Figure |2(a)| shows the sim- 
ulations for N = 4 channels, M = 2 users, with 6 = 
(0.9,0.8,0.7,0.6). In Figure [2(b)] we have N = 5 channels, 
M = 3 users, and 6 = (0.9, 0.8, 0.7, 0.6, 0.5). In Figure 
|2(c)| there are N — 7 channels, and M = 4 users, with 
9 = (0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3). 

As expected, DLF has the least regret, since one of the key 
features of DLF is that it does not favor any one user over 
another. The chance for each user to use any one of the M 
best channels are the same. It utilizes its observations on all the 
M best channels, and thus makes less mistakes for exploring. 
DLF-Naive not only has the greatest regret, also uses more 
storage. DLP has greater regret than DLF since user m has to 
spend time on exploring the M — 1 channels in the M best 
channels expect channel k ^ o* n . Not only this results in a loss 
of reward, this also results in the collisions among users. To 
show this fact, we present the number of times that a channel 
is accessed by all M users up to time n = 10 6 in Figure [3] 

Figure [2] also explores the impact of increasing the number 
of channels N, and secondary users M on the regret expe- 
rienced by the different policies with the minimum distance 
between arms A m j n fixed. It is clearly that as the number of 
channels and secondary users increases, the regret, as well as 
the regret gap between different algorithms increases. 



VII. Conclusion 

The problem of distributed multi-armed bandits is a fun- 
damental extension of the classic online learning framework 
that finds application in the context of opportunistic spectrum 
access for cognitive radio networks. We have made two key 
algorithmic contributions to this problem. For the case of 
prioritized users, we presented the first distributed policy that 
yields logarithmic regret over time without prior assumptions 
about the mean arm rewards. For the case of fair access, we 
presented a policy that yields order-optimal regret scaling in 
terms of the numbers of users and arms, which is also an 
improvement over prior results. 

Through simulations, we further show that the overall regret 
is lower for the fair access policy. In future work, we plan to 
undertake more comprehensive simulation based comparison 
of the proposed policy with previously proposed schemes, 
including over more realistic channel models. We are also 
interested in considering extensions of our distributed policies 
to multi-armed bandits with dependent arms, such as the 
combinatorial model considered in [6|. 
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